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Abstract 

We propose a second-order (Hessian or Hessian-free) based optimization method 
for variational inference inspired by Gaussian backpropagation, and argue that 
quasi-Newton optimization can be developed as well. This is accomplished by 
generalizing the gradient computation in stochastic backpropagation via a repa¬ 
rameterization trick with lower complexity. As an illustrative example, we ap¬ 
ply this approach to the problems of Bayesian logistic regression and variational 
auto-encoder (VAE). Additionally, we compute bounds on the estimator variance 
of intractable expectations for the family of Lipschitz continuous function. Our 
method is practical, scalable and model free. We demonstrate our method on sev¬ 
eral real-world datasets and provide comparisons with other stochastic gradient 
methods to show substantial enhancement in convergence rates. 


1 Introduction 

Generative models have become ubiquitous in machine learning and statistics and are now widely 
used in fields such as bioinformatics, computer vision, or natural language processing. These models 
benefit from being highly interpretable and easily extended. Unfortunately, inference and learning 
with generative models is often intractable, especially for models that employ continuous latent 
variables, and so fast approximate methods are needed. Variational Bayesian (VB) methods O deal 
with this problem by approximating the true posterior that has a tractable parametric form and then 
identifying the set of parameters that maximize a variational lower bound on the marginal likelihood. 
That is, VB methods turn an inference problem into an optimization problem that can be solved, for 
example, by gradient ascent. 

Indeed, efficient stochastic gradient variational Bayesian (SGVB) estimators have been devel¬ 
oped for auto-encoder models El and a number of papers have followed up on this approach 
GlllSKIllIllBlEillol. Recently, 1^ provided a complementary perspective by using stochastic 
backpropagation that is equivalent to SGVB and applied it to deep latent gaussian models. Stochas¬ 
tic backpropagation overcomes many limitations of traditional inference methods such as the mean- 
field or wake-sleep algorithms ifT^ due to the existence of efficient computations of an unbiased 
estimate of the gradient of the variational lower bound. The resulting gradients can be used for pa¬ 
rameter estimation via stochastic optimization methods such as stochastic gradient decent(SGD) or 
adaptive version (Adagrad) ii. 


* Equal contribution as the first author 
^Refer to Hong Kong University of Science and Technology 
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Unfortunately, methods such as SGD or Adagrad converge slowly for some difficult-to-train models, 
such as untied-weights auto-encoders or recurrent neural networks. The common experience is that 
gradient decent always gets stuck near saddle points or local extrema. Meanwhile the learning rate 
is difficult to tune. ca gave a clear explanation on why Newton’s method is preferred over gradient 
decent, which often encounters under-fitting problem if the optimizing function manifests patholog¬ 
ical curvature. Newton’s method is invariant to affine transformations so it can take advantage of 
curvature information, but has higher computational cost due to its reliance on the inverse of the 
Hessian matrix. This issue was partially addressed in ifT^ where the authors introduced Hessian 
free (HF) optimization and demonstrated its suitability for problems in machine learning. 

In this paper, we continue this line of research into 2^^ order variational inference algorithms. In¬ 
spired by the property of location scale families |l8l, we show how to reduce the computational 
cost of the Hessian or Hessian-vector product, thus allowing for a 2“^ order stochastic optimiza¬ 
tion scheme for variational inference under Gaussian approximation. In conjunction with the HF 
optimization, we propose an efficient and scalable 2^^ order stochastic Gaussian backpropagation 
for variational inference called HFSGVI. Alternately, L-BFGS lO version, a quasi-Newton method 
merely using the gradient information, is a natural generalization of order variational inference. 

The most immediate application would be to look at obtaining better optimization algorithms for 
variational inference. As to our knowledge, the model currently applying 2”^ order information is 
LDA Eiini, where the Hessian is easy to compute Km. In general, for non-linear factor models 
like non-linear factor analysis or the deep latent Gaussian models this is not the case. Indeed, 
to our knowledge, there has not been any systematic investigation into the properties of various 
optimization algorithms and how they might impact the solutions to optimization problem arising 
from variational approximations. 

The main contributions of this paper are to fill such gap for variational inference by introducing a 
novel 2^^ order optimization scheme. First, we describe a clever approach to obtain curvature infor¬ 
mation with low computational cost, thus making the Newton’s method both scalable and efficient. 
Second, we show that the variance of the lower bound estimator can be bounded by a dimension-free 
constant, extending the work of 1251 that discussed a specific bound for univariate function. Third, 
we demonstrate the performance of our method for Bayesian logistic regression and the VAE model 
in comparison to commonly used algorithms. Convergence rate is shown to be competitive or faster. 


2 Stochastic Backpropagation 

In this section, we extend the Bonnet and Price theorem El ED to develop 2“^ order Gaussian 
backpropagation. Specifically, we consider how to optimize an expectation of the form [/(z|x)], 
where z and x refer to latent variables and observed variables respectively, and expectation is taken 
w.r.t distribution qo and / is some smooth loss function (e.g. it can be derived from a standard 
variational lower bound ID). Sometimes we abuse notation and refer to /(z) by omitting x when no 
ambiguity exists. To optimize such expectation, gradient decent methods require the derivatives, 
while Newton’s methods require both the gradients and Hessian involving 2^^ order derivatives. 


2.1 Second Order Gaussian Backpropagation 


If the distribution is a (i;^ -dimensional Gaussian A/’(z|/L4, C), the required partial derivative is easily 
computed with a lower algorithmic cost 0{d‘l) f25\ . By using the property of Gaussian distribution, 
we can compute the 2^^ order partial derivative of [/(z)] as follows: 


[/(z)] 


— 2VcyEA^(z|ft,C)[/(z)]5 (1) 

2®Ar(z|^,C) Zi,Zk,zifi^)] ■ (3) 


Eq. Q, Q, ^ (proof in supplementary) have the nice property that a limited number of samples 
from q are sufficient to obtain unbiased gradient estimates. However, note that Eq. 0,0. needs 
to calculate the third and fourth derivatives of /(z), which is highly computationally inefficient. To 
avoid the calculation of high order derivatives, we use a co-ordinate transformation. 
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2.2 Covariance Parameterization for Optimization 


By constructing the linear transformation (a.k.a. reparameterization) z = Re, where e ~ 
A/’(0, IdJ, we can generate samples from any Gaussian distribution A/’(/L6, C) by simulating data 
from a standard normal distribution, provided the decomposition C = RR^ holds. This fact allows 
us to derive the following theorem indicating that the computation of 2”^ order derivatives can be 
scalable and programmed to run in parallel. 

Theorem 1 (Fast Derivative). If f is a twice differentiable function and z follows Gaussian dis¬ 
tribution C), C = RR^, where both the mean [jl and R depend on a d-dimensional pa¬ 
rameter 0 = i-e. we have ,REAr(M.c)[/(z)] = Ee^Ar(o.idj[e‘^ OH] , and 

ViiE^(^,c)[/(z)] = Ee-^Ar(o.i<ij[(ee^) ® Hj. This then implies 


V0,E^(^,c)[/(z)] 


Ee^A/'(0,I) 

Ee~A^(0,I) 


y ^{^l + Re) 

® J ’ 

d{iJi + Re) ^ d{iJi + Re) y + Re) 

ddiA^ 


where ( 8 ) is Kronecker product, and gradient g, Hessian H are evaluated at p-\- Re in terms of f(z). 

If we consider the mean and covariance matrix as the variational parameters in variational inference, 
the first two results w.r.t /L6, R make parallelization possible, and reduce computational cost of the 
Hessian-vector multiplication due to the fact that (^4^ (g) B)vec{V) = vec{AVB). If the model has 
few parameters or a large resource budget (e.g. GPU) is allowed, Theoreml^launches the foundation 
for exact 2“^ order derivative computation in parallel. In addition, note that the 2”^ order gradient 
computation on model parameter 6 only involves matrix-vector or vector-vector multiplication, thus 
leading to an algorithmic complexity that is 0{d‘l) for 2^^ order derivative of 6, which is the same 
as order gradient 1^ . The derivative computation at function / is up to 2“^ order, avoiding to 
calculate 3^^ or 4* order derivatives. One practical parametrization assumes a diagonal covariance 
matrix C = diagla^,..., }. This reduces the actual computational cost compared with Theorem 

M albeit the same order of the complexity (0{d‘l)) (see supplementary material). Theorem H holds 
for a large class of distributions in addition to Gaussian distributions, such as student t-distribution. 
If the dimensionality d of embedded parameter 9 is large, computation of the gradient and 
Hessian (differ from g, H above) will be linear and quadratic w.r.t d, which may be unacceptable. 
Therefore, in the next section we attempt to reduce the computational complexity w.r.t d. 


2.3 Apply Reparameterization on Second Order Algorithm 


In standard Newton’s method, we need to compute the Hessian matrix and its inverse, which is 
intractable for limited computing resources. CD applied Hessian-free (HF) optimization method in 
deep learning effectively and efficiently. This work largely relied on the technique of fast Hessian 
matrix-vector multiplication 1^ . We combine reparameterization trick with Hessian-free or quasi- 
Newton to circumvent matrix inverse problem. 


Hessian-free Unlike quasi-Newton methods HF doesn’t make any approximation on the Hessian. 
HF needs to compute H^v, where v is any vector that has the matched dimension to H^, and then 
uses conjugate gradient algorithm to solve the linear system H^v = — VF(0)^v, for any objective 
function F. csi gives a reasonable explanation for Hessian free optimization. In short, unlike a 
pre-training method that places the parameters in a search region to regularize 171, HF solves issues 
of pathological curvature in the objective by taking the advantage of rescaling property of Newton’s 
method. By definition H^v = lim^^o indicating that H^v can be numerically 

computed by using finite differences at 7. However, this numerical method is unstable for small 7. 


In this section, we focus on the calculation of H^v by leveraging a reparameterization trick. 
Specifically, we apply an 7^-operator technique 1231 f or co mputing the product H^v exactly. Let 
F = IEa/'(;x,c) [/(z)^and reparameterize z again as Sec. 


2.2 


we do variable substitution 9 ^ 0 + 7V 
after gradient Eq. Q is obtained, and then take derivative on 7. Thus we have the following analyt- 
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Algorithm 1 Hessian-free Algorithm on Stochastic Gaussian Variational Inference (HFSGVI) 

Parameters: Minibatch Size B, Number of samples to estimate the expectation M (= 1 as default), 
Input: Observation X (and Y if required), Lower bound function C = IEa/'(/x,c) [fc] 

Output: Parameter 6 after having converged. 

1: for t = 1, 2,... do 

2: ^ Randomly draw B datapoints from full data set X; 

3: ^mb=i ^ sample M times from A/'(0,1) for each x^; 

4: Define gradient G(6>) = jjT,b E™, §6.™ = z{fc{z\xb))\z=f,+Re ^^; 

5: Define function B(0, v) = V^G(0 + 7 v) where v is a d-dimensional vector; 

6: Using Conjugate Gradient algorithm to solve linear system: ^{6t, Pt) = — 

7: Ot+i=et^Pu 

8 : end for 


ical expression for Hessian-vector multiplication: 


H^v 


—VF(0 + 7v) 


®III 

■ T d{fi{d)+R{e)e) 



® de 

0^0+7v 


7=0 


EAr(o,i) 


d-y 


^{^l{e) + Ii{e)e) 


de 


0-^9-\-'yw y 


( 6 ) 


7=0 


Eq. ([^ is appealing since it does not need to store the dense matrix and provides an unbiased H^v 
estimator with a small sample size. In order to conduct the 2“^ order optimization for variational in¬ 
ference, if the computation of the gradient for variational lower bound is completed, we only need to 
add one extra step for gradient evaluation via Eq. which has the same computational complexity 
as Eq. Q. This leads to a Hessian-free variational inference method described in Algorithmic 


For the worst case of HF, the conjugate gradient (CG) algorithm requires at most d iterations to 
terminate, meaning that it requires d evaluations of H^v product. However, the good news is that 
CG leads to good convergence after a reasonable number of iterations. In practice we found that it 
may not necessary to wait CG to converge. In other words, even if we set the maximum iteration K 
in CG to a small fixed number (e.g., 10 in our experiments, though with thousands of parameters), 
the performance does not deteriorate. The early sloping strategy may have the similar effect of 
Wolfe condition to avoid excessive step size in Newton’s method. Therefore we successfully reduce 
the complexity of each iteration to 0{Kdd‘l), whereas 0{dd‘l) is for one SGD iteration. 


L-BFGS Limited memory BEGS utilizes the information gleaned from the gradient vector to ap¬ 
proximate the Hessian matrix without explicit computation, and we can readily utilize it within 
our framework. The basic idea of BEGS approximates Hessian by an iterative algorithm = 
B( + AG(AG7/A6>tA6>7 - BtA6>(A6>7B(/A6>7BtA6>(, where AG* = Gj - Gt_i and 
AOt = By Eq. Q, the gradient Gf at each iteration can be obtained without any 

difficulty. However, even if this low rank approximation to the Hessian is easy to invert analyti¬ 
cally due to the Sherman-Morrison formula, we still need to store the matrix. L-BFGS will further 
implicitly approximate this dense B^ or B^^ by tracking only a few gradient vectors and a short 
history of parameters and therefore has a linear memory requirement. In general, L-BFGS can per¬ 
form a sequence of inner products with the K most recent AOt and AGt, where iT is a predefined 
constant (10 or 15 in our experiments). Due to the space limitations, we omit the details here but 
none-the-less will present this algorithm in experiments section. 


2.4 Estimator Variance 

The framework of stochastic backpropagation cniniiiiiiia extensively uses the mean of very 
few samples (often just one) to approximate the expectation. Similarly we approximate the left side 
of Eq. Q, ([^, ([^ by sampling few points from the standard normal distribution. However, the 
magnitude of the variance of such an estimator is not seriously discussed. (25\ simply explored the 
variance quantitatively for separable functions. CD merely borrowed the variance reduction tech¬ 
nique from reinforcement learning by centering the learning signal in expectation and performing 
variance normalization. Here, we will generalize the treatment of variance to a broader family, 
Lipschitz continuous function. 
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Theorem 2 (Variance Bound). If f is an L-Lipschitz differentiable function and e ^ A/’(0, 

thenm{^)-nf{^)]?]<^- 


The proof of Theorem [^( see supplementary) employs the properties of sub-Gaussian distributions 
and the duplication trick that are commonly used in learning theory. Significantly, the result implies 
a variance bound independent of the dimensionality of Gaussian variable. Note that from the proof, 
we can only obtain the E ^gA(/(€) E[/(€)])j ^ gL^A^7r^/8 A > 0. Though this result is enough 
to illustrate the variance independence of dz , we can in fact tighten it to a sharper upper bound by 
a constant scalar, i.e. thus leading to the result of Theorem ^with Var(/(e)) < Lf. If 

all the results above hold for smooth (twice continuous and differentiable) functions with Lipschitz 
constant L then it holds for all Lipschitz functions by a standard approximation argument. This 
means the condition can be relaxed to Lipschitz continuous function. 


Corollary 3 (Bias Bound). 


( Ji Em=i - E[/(e)] > 


< 2e~ 


It is also worth mentioning that the significant corollary of Theorem is probabilistic inequality 
to measure the convergence rate of Monte Carlo approximation in our setting. This tail bound, 
together with variance bound, provides the theoretical guarantee for stochastic backpropagation on 
Gaussian variables and provides an explanation for why a unique realization (M = 1) is enough 
in practice. By reparametrization, Eq. 0,(0 can be formulated as the expectation w.r.t the 
isotropic Gaussian distribution with identity covariance matrix leading to Algorithm Thus we 
can rein in the number of samples for Monte Carlo integration regardless dimensionality of latent 
variables z. This seems counter-intuitive. However, we notice that larger L may require more 
samples, and Lipschitz constants of different models vary greatly. 


3 Application on Variational Auto-encoder 

Note that our method is model free. If the loss function has the form of the expectation of a function 
w.r.t latent Gaussian variables, we can directly use Algorithmic In this section, we put the emphasis 
on a standard framework VAE model ini that has been intensively researched; in particular, the 
function endows the logarithm form, thus bridging the gap between Hessian and fisher information 
matrix by expectation (see a survey 12 ^ and reference therein). 

3.1 Model Description 

Suppose we have N i.i.d. observations X = where G is a data vector that can 

take either continuous or discrete values. In contrast to a standard auto-encoder model constructed 
by a neural network with a bottleneck structure, VAE describes the embedding process from the 
prospective of a Gaussian latent variable model. Specifically, each data point x follows a generative 
model p- 0 (x|z), where this process is actually a decoder that is usually constructed by a non-linear 
transformation with unknown parameters and a prior distribution Pt/,(z). The encoder or recog¬ 
nition model q(f){z\x.) is used to approximate the true posterior p^{z\x.), where 0 is similar to the 
parameter of variational distribution. As suggested in csEiiia, multi-layered perceptron (MLP) 
is commonly considered as both the probabilistic encoder and decoder. We will later see that this 
construction is equivalent to a variant deep neural networks under the constrain of unique realization 
for z. Eor this model and each datapoint, the variational lower bound on the marginal likelihood is, 

logp^(xW) > Eq^(^|x(i))[logp^(xW|z)] - DifL(g 0 (z|xW)||p^(z)) = £(xW). ( 7 ) 

We can actually write the KL divergence into the expectation term and denote ('0,0) as 6. 
By the previous discussion, this means that our objective is to solve the optimization problem 
arg max0 >C(x‘^*^) of full dataset variational lower bound. Thus L-BEGS or HE SGVI algorithm 
can be implemented straightforwardly to estimate the parameters of both generative and recognition 
models. Since the first term of reconstruction error appears in Eq. 0 with an expectation form 
on latent variable, imiia used a small finite number M samples as Monte Carlo integration with 
reparameterization trick to reduce the variance. This is, in fact, drawing samples from the stan¬ 
dard normal distribution. In addition, the second term is the KL divergence between the variational 
distribution and the prior distribution, which acts as a regularizer. 
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3.2 Deep Neural Networks with Hybrid Hidden Layers 

In the experiments, setting M = 1 can not only achieve excellent performance but also speed up 
the program. In this special case, we discuss the relationship between VAE and traditional deep 
auto-encoder. For binary inputs, denote the output as y, we have logp- 0 (x|z) = Ylf=i + 

(l — Xj) log(l — yj), which is exactly the negative cross-entropy. It is also apparent that logp-^(x|z) 
is equivalent to negative squared error loss for continuous data. This means that maximizing the 
lower bound is roughly equal to minimizing the loss function of a deep neural network (see Figure 
1 in supplementary), except for different regularizers. In other words, the prior in VAE only im¬ 
poses a regularizer in encoder or generative model, while £2 penalty for all parameters is always 
considered in deep neural nets. From the perspective of deep neural networks with hybrid hidden 
nodes, the model consists of two Bernoulli layers and one Gaussian layer. The gradient computation 
can simply follow a variant of backpropagation layer by layer (derivation given in supplementary). 
To further see the rationale of setting M = 1, we will investigate the upper bound of the Lipschitz 
constant under various activation functions in the next lemma. As Theorem [^implies, the variance 
of approximate expectation by finite samples mainly relies on the Lipschitz constant, rather than 
dimensionality. According to Lemma imposing a prior or regularization to the parameter can 
control both the model complexity and function smoothness. Lemmaalso implies that we can get 
the upper bound of the Lipschitz constant for the designed estimators in our algorithm. 

Lemma 4. For a sigmoid activation function g in deep neural networks with one Gaussian layer z, 
z ~ A/’(/l4, C), C = R^R. Let z = p, Re, then the Lipschitz constant of giWifp + Re) + hi) 
is bounded by |||ITi^R|| 2 , where Wi^ is ith row of weight matrix and hi is the ith element bias. 
Similarly, for hyperbolic tangent or softplus function, the Lipschitz constant is bounded by || ITi,R|| 2 . 


4 Experiments 

We apply our 2“^ order stochastic variational inference to two different non-conjugate models. First, 
we consider a simple but widely used Bayesian logistic regression model, and compare with the 
most recent order algorithm, doubly stochastic variational inference (DSVI) 1^ , designed for 
sparse variable selection with logistic regression. Then, we compare the performance of VAE model 
with our algorithms. 

4.1 Bayesian Logistic Regression 

Given a dataset where each instance G includes the default feature 1 and 

Pi G { — 1,1} is the binary label, the Bayesian logistic regression models the probability of out¬ 
puts conditional on features and the coefficients /3 with an imposed prior. The likelihood and the 
prior can usually take the form as 1^) -^( 0 ? respectively, where g is sigmoid 

function and A is a diagonal covariance matrix for simplicity. We can propose a variational Gaussian 
distribution q{(d\p^ C) to approximate the posterior of regression parameter. If we further assume a 
diagonal C, a factorized form t)oth efficient and practical for inference. Unlike 

iteratively optimizing A and C as in variational EM, 12^ noticed that the calculation of the gra¬ 
dient w.r.t the lower bound indicates the updates of A can be analytically worked out by variational 
parameters, thus resulting a new objective function for the representation of lower bound that only 
relies on C (details refer to |[28ll ). We apply our algorithm to this variational logistic regression on 
three appropriate datasets: DukeBreast and Leukemia are small size but high-dimensional for 
sparse logistic regression, and a9a which is large. See Table[^for additional dataset descriptions. 

Fig. shows the convergence of Gaussian variational lower bound for Bayesian logistic regression 
in terms of running time. It is worth mentioning that the lower bound of HFSGVI converges within 
3 iterations on the small datasets DukeBreast and Leukemia. This is because all data points 
are fed to all algorithms and the HFSGVI uses a better approximation of the Hessian matrix to 
proceed 2^^ order optimization. L-BFGS-SGVI also take less time to converge and yield slightly 
larger lower bound than DSVI. In addition, as an SGD-based algorithm, it is clearly seen that DSVI 
is less stable for small datasets and fiuctuates strongly even at the later optimized stage. For the 
large a 9 a, we observe that HFSGVI also needs 1000 iterations to reach a good lower bound and 
becomes less stable than the other two algorithms. However, L-BFGS-SGVI performs the best 
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Table 1: Comparison on number of misclassification 


Dataset(size: #traiii/test/feature) 

DSVI 

L-BFGS-SGVI 

HFSGVI 

train 

test 

train 

test 

train 

test 

DukeBreast(38/4/7129) 

0 

2 

0 

1 

0 

0 

Leukemia(38/34/7129) 

0 

3 

0 

3 

0 

3 

A9a(32561/16281/123) 

4948 

2455 

4936 

2427 

4931 

2468 



time(s) time(s) time(s) 

Figure 1: Convergence rate on logistic regression (zoom out or see larger figures in supplementary) 

both in terms of convergence rate and the final lower bound. The misclassification report in Table 
reflects the similar advantages of our approach, indicating a competitive predication ability on 
various datasets. Finally, it is worth mentioning that all three algorithms learn a set of very sparse 
regression coefficients on the three datasets (see supplement for additional visualizations). 

4.2 Variational Auto-encoder 

We also apply the 2^^ order stochastic variational inference to train a VAE model (setting M = 1 for 
Monte Carlo integration to estimate expectation) or the equivalent deep neural networks with hybrid 
hidden layers. The datasets we used are images from the Frey Face, Olivetti Face and MNIST. We 
mainly learned three tasks by maximizing the variational lower bound: parameter estimation, images 
reconstruction and images generation. Meanwhile, we compared the convergence rate (running 
time) of three algorithms, where in this section the compared SGD is the Ada version O that is 
recommended for VAE model in ifTTlI^ . The experimental setting is as follows. The initial weights 
are randomly drawn from JV{0, O.Ol^I) or N'{0, O.OOl^I), while all bias terms are initialized as 0. 
The variational lower bound only introduces the regularization on the encoder parameters, so we add 
an £2 regularizer on decoder parameters with a shrinkage parameter 0.001 or 0.0001. The number of 
hidden nodes for encoder and decoder is the same for all auto-encoder model, which is reasonable 
and convenient to construct a symmetric structure. The number is always tuned from 200 to 800 
with 100 increment. The mini-batch size is 100 for L-BFGS and Ada, while larger mini-batch is 
recommended for HF, meaning it should vary according to the training size. 

The detailed results are shown in Fig. |^and[^ Both Hessian-free and L-BFGS converge faster than 
Ada in terms of running time. HFSGVI also performs competitively with respet to generalization 
on testing data. Ada takes at least four times as long to achieve similar lower bound. Theoretically, 
Newton’s method has a quadratic convergence rate in terms of iteration, but with a cubic algorith¬ 
mic complexity at each iteration. However, we manage to lower the computation in each iteration 
to linear complexity. Thus considering the number of evaluated training data points, the 2”^ order 
algorithm needs much fewer step than 1^^ order gradient descent (see visualization in supplementary 
on MNIST). The Hessian matrix also replaces manually tuned learning rates, and the affine invari¬ 
ant property allows for automatic learning rate adjustment. Technically, if the program can run in 
parallel with GPU, the speed advantages of 2“^ order algorithm should be more obvious ED. 

Fig. I^b) and Fig. |^b) are reconstruction results of input images. From the perspective of deep 
neural network, the only difference is the Gaussian distributed latent variables z. By corollary of 
Theorem|^ we can roughly tell the mean /j, is able to represent the quantity of z, meaning this layer 
is actually a linear transformation with noise, which looks like dropout training O. Specifically, 
Olivetti includes 64x64 pixels faces of various persons, which means more complicated models 
or preprocessing ca (e.g. nearest neighbor interpolation, patch sampling) is needed. However, 
even when simply learning a very bottlenecked auto-encoder, our approach can achieve acceptable 
results. Note that although we have tuned the hyperparameters of Ada by cross-validation, the 
best result is still a bunch of mean faces. For manifold learning. Fig. |^c) represents how the 
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(a) Convergence (b) Reconstruction (c) Manifold by Generative Model 

Figure 2: (a) shows how lower bound increases w.r.t program running time for different algorithms; (b) il¬ 
lustrates the reconstruction ability of this auto-encoder model when dz = 20 (left 5 columns are randomly 
sampled from dataset); (c) is the learned manifold of generative model when dz = 2. 


Olivetti Face 



Figure 3: (a) shows running time comparison; (b) illustrates reconstruction comparison without patch sam¬ 
pling, where dz = 100: top 5 rows are original faces. 


learned generative model can simulate the images by HFSGVI. To visualize the results, we choose 
the 2D latent variable z in p- 0 (x|z), where the parameter -0 is estimated by the algorithm. The 
two coordinates of z take values that were transformed through the inverse CDF of the Gaussian 
distribution from equal distance grid (10 x 10 or 20x20) on the unit square. Then we merely use the 
generative model to simulate the images. Besides these learning tasks, denoising, imputation 1251 
and even generalizing to semi-supervised learning ca are possible application of our approach. 

5 Conclusions and Discussion 

In this paper we proposed a scalable 2^^ order stochastic variational method for generative models 
with continuous latent variables. By developing Gaussian backpropagation through reparametriza- 
tion we introduced an efficient unbiased estimator for higher order gradients information. Combin¬ 
ing with the efficient technique for computing Hessian-vector multiplication, we derived an efficient 
inference algorithm (HFSGVI) that allows for joint optimization of all parameters. The algorithmic 
complexity of each parameter update is quadratic w.r.t the dimension of latent variables for both 
and 2“^ derivatives. Furthermore, the overall computational complexity of our 2“^ order SGVI is 
linear w.r.t the number of parameters in real applications just like SGD or Ada. However, HFSGVI 
may not behave as fast as Ada in some situations, e.g., when the pixel values of images are sparse 
due to fast matrix multiplication implementation in most softwares. 

Future research will focus on some difficult deep models such as RNNs IHEtI or Dynamic SBN 
0 . Because of conditional independent structure by giving sampled latent variables, we may con¬ 
struct blocked Hessian matrix to optimize such dynamic models. Another possible area of future 
work would be reinforcement learning (RL) 1201 . Many RL problems can be reduced to compute 
gradients of expectations (e.g., in policy gradient methods) and there has been series of exploration 
in this area for natural gradients. However, we would suggest that it might be interesting to consider 
where stochastic backpropagation fits in our framework and how 2 “^ order computations can help. 

Acknolwedgement This research was supported in part by the Research Grants Council of the 
Hong Kong Special Administrative Region (Grant No. 614513). 
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Appendix 

6 Proofs of the Extending Gaussian Gradient Equations 


Lemma 5. Let /(z) : 


IZ be an integrable and twice differentiable function. The second gradient of 


the expectation of f{z) under a Gaussian distribution A/’(z|/Lt, C) with respect to the mean pi can be expressed 
as the expectation of the Hessian of f{z): 


^ Zi,Zj f i^)] — 2Vc'ijIE,Ar(z|^i.,C) [/(z)] . 

Proof From Bonnet’s theorem |@1, we have 

V;^.Ea^(z:|^,C)[/(z)] = Ea^(z:|^,C)[V;s./(z)]. 

Moreover, we can get the second order derivative, 

[/(z)] = ^ fj-i (^Af{z\fL,C)[^ Zj f i^)]) 

= y’v^.A/'(z|M,C)V.^/(z)dz 

= - J Vz,J\f{z\n,C)Vzjf{z)dz 


(8) 


(9) 


jMi 


Zi= + 00 


Zi = -oo 


z|M,C)V.,/(z)dz- 

= EAr(z|^^,c)[V^i.^,-/(z)] 

= 2VcijEAr(z|M,c)[/(z)], 
where the last euality we use the equation 

Vc.,V(z|m,C) = iv^,.,,A/'(z|/.,C). 


Jm, 


z|M,C)V;5i,z./(z)dz 


( 10 ) 

□ 


Lemma 6. Let f{z) : TZ’^ 


IZ be an integrable and fourth differentiable function. The second gradient 


of the expectation of f(z) under a Gaussian distribution A/’(z|/Lt, C) with respect to the covariance C can be 
expressed as the expectation of the forth gradient of f(z) 


[/(z)] — ^j\f{z\^,c::^)^zi,zj,zk,zif{'^y\- 

Proof. From Price’s theorem m, we have 

[/(z)] = Zi,Zjfi^')]- 


( 11 ) 


( 12 ) 


Vcij,Cfc,,E7v-(z|p.c)[/(z)] = (EAr(z|M.c)[V^,,j^./(z)]j 

= I j Vc,,W(z|M,C)V^.,,^/(z)dz 
= 11 V^,,,W(^lM,C)V^„,^,/(z)dz 

= I J ^f{2\^^,C)Vt^,ZJ,zk,zJiz)dz 

~ Zi,Zj,Zk,zif{^)]- 

In the third equality we use the Eq.(p^ again. For the fourth equality we use the product rule for integrals 
twice. □ 

From Eq. and Eq.(p^ we can straightforward write the second order gradient of interaction term as well: 


V^„Cfc,,EAA(M.C)[/(a)] = [yli,z^,zj{z)] ■ 


(13) 
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7 Proof of Theorem 1 


By using the linear transformation z — fji + Re, where e ~ iV(0, Id^), we can generate samples form any 
Gaussian distribution A/^(/it, C), C = RR^, where /Lt(0), R(0) are both dependent on parameter 0 = 

Then the gradients of the expectation with respect to /it and (or) R is 


VRE;vr(/x,c)[/(z)] 

V^Ea^(^,c)[/(z)] 


VrEa^(o,i) [/(m + Re)] = Ea^(o,i) [eg^] 
VR. jE[eiQk] = Ej^(^o^i)[ejeiHik] 
^(0,1) [^i9k] =Ej^^o^i)[eiHik] 
EAr(o,i) [H] 


where g = {gj}^=i is the gradient of / evaluated at /it + Re, H = {Hij}dzxdz is the Hessian of / evaluated 

at /Lf + Re. 

Furthermore, we write the second order derivatives into matrix form: 

V^,rEaa(m.c)[/(z)] = EAA(o.i)[e'^ ® H], 

VrE^(^,c)[/(z)] = E^(o,i)[(ee^) (g)H]. 


For a particular model, such as deep generative model, fx and C are depend on the model parameters, we denote 
them as 0 = i-^- M ^ Combining Eqj^and Eq[^and using the chain rule we have 


V0,EjV(^.c)[/(z)] 


EAr(,i.C) 


_|_ 1 rjY (h — 

^ 801 2 V 901 


where g and H are the first and second order gradient of /(z) for abusing notation. This formulation involves 
matrix-matrix product, resulting in an algorithmic complexity 0{d‘l) for any single element of 0 w.r.t /(z), 
and 0{dd‘l), 0{d‘^d‘l) for overall gradient and Hessian respectively. 

Considering C = RR^, z = /it + Re, 


VeiE^(^,c)[/(z)] 


EaA(o,i) 


jdn 
' 80 



aRV 

801)_ 


EaA(o,i) 


T T ^R 

® 801801 ^ 


Eor the second order, we have the following separated formulation: 

din 


V0^^0^^EAr(,x,c)[/(z)] - V0^^EAr(o,i) 


dRi 


dOi, ^ dOi, 




— E^(o,i) 




dRjk \ dfii 




^ dOi^ J dOi 
dRki 


- + ^9i 


d’^lij 

dOi^dOi^ 


dOi, 


i 


EAr(0,I) 

dK 


dOi, 

T 


dRij d^Rij 


d0i2 


^,3 


dOi^di^ 


dfjL 

"h^+I 

f— 

w. 

901, ^ ' 

\90i, ) 


dOi^ 
®^Ar(0,I) 


H^ + 


dK 


dOi 


dOi^ \dOi^ 

H 


801^ 90in 


aR T a"R 

801 ^^ ^ 80 i, 80 h^ 


+ Re) ^ + Re) Ta^(/i + Re) 


80 ,. 


+ s 


80, n 


It is noticed that for second order gradient computation, it only involves matrix-vector or vector-vector multi¬ 
plication, thus leading to an algorithmic complexity 0{dl) for each pair of 6. 
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One practical parametrization is C = diag{cri, or R = diag{cri, ...,crcz^}, which will reduce the 

actual second order gradient computation complexity, albeit the same order of Then we have 


V0^EAr(,x,c)[/(z)] 




®^Ar(0,I) 

®^AA(0,I) 




d0i 


T dfjL , / ^ xT dcT 

® ^ Wi 


9m XT d/j. 

d0i, d0,. 


+ e© 


da 

W, 


^ t^_M_ 

dOi^ ^ dOi.dOi^ 


+ eO 


da 


f da 


H eO 


da 

dOi^ 


+ (e 0 g) 


®^Ar(0,I) 


d^a 

dOi^dOi^ _ 

f dfi da \ ^ tt 

TT:^ + e0-— H -^+€0-— 






dOi^ 


+ g 


T ^ ay a^(e0cT) 

dOi^dOi^ dOi^dOi^ 


(14) 


(15) 


where 0 is Hadamard (or element-wise) product, and a = (cri,..., cr^^)^. 

Derivation for Hessian-Free SGVI without 6 Plugging This means (/it, R) is the parameter for variational 
distribution. According the derivation in this section, the Hessian matrix with respect to (/it, R) can represented 

1 Ti Pqj- X {dz + 1) matrix V with the same dimensionality of 


as H;x,r = Ej^(o,i) 


[1,6' 


)H 


[/Lt, R], we also have the Hessian-vector multiplication equation. 


H^,R^ec(V) = Ea^(o,i) 


vec HV 


1 

e 



where uec(-) denotes the vectorization of the matrix formed by stacking the columns into a single column 
vector. This allows an efficient computation both in speed and storage. 


8 Forward-Backward Algorithm for Special Variation Auto-encoder Model 


We illustrate the equivalent deep neural network model (Figure by setting M = 1 in VAE, and derive 
the gradient computation by lawyer-wise backpropagation. Without generalization, we give discussion on the 
binary input and diagonal covariance matrix, while it is straightforward to write the continuous case. For binary 
input, the parameters are {(Wz, 6i)}f=i. 

The feedforward process is as follows: 

he = tanh(fyix + 6i) 

Me = 14^2 he + b2 

ae = exp{(W^ 3 he + ^3)72} 

e - mojdz) 

Z = Me + CTe 0 e 

hd = tanh(W4Z + 64) 

y = sigmoid(VF5hd + 65). 
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Output: y 


Hidden 

decoder 

layer 



W 4 , b 4 


Gaussian 
latent layer 



■iw7;b”ir' 




Hidden 

encoder 

layer 



Wi, bi 


(W 5 , b^), 

(Wg, bg) if X is continuous. 



Input: X 


Figure 4: Auto-encoder Model by Deep Neural Nets. 


Considering the cross-entropy loss function, the backward process for gradient backpropagation computation 
is: 

^5 = X© (1 - y) - (1 - x) ©y 

Vw5 = ^shj, Vf>5 = <55 

^4 = (H/5""55)©(l-hrf©hd) 

Vw4 — ^4Z^, V 64 = ^4 

^3 = 0.5 * [{W 4 S 4 ) 0 (Z - /Lte) + 1 - CTg] 

Vvt^3 = , Vfeg = ^3 

S 2 = W 4 S 4 — IJie 

Vu/2 = Vh2 — ^2 

Si = (fy2'^^2 + VF3'^^3)0(l-he0he) 

Vwi = ^ix , Vbi = ^ 1 . 


Notice that when we compute the differences ^ 2 , ^ 3 , we also include the prior term which acts as the role 
of regularization penalty. In addition, we can add the £2 penalty to the weight matrix as well. The only 
modification is to change the expression of Vw^. by adding XWi, where A is a tunable hyper-parameter. 


9 7^-Operator Derivation 

Define 7l^{f(6)} = ^/(0 + 7 v)| , then H^v = 7l^{VeF(6)}. First, mapping v to 

{'F{Wi},'R{bi}}i=i. Then, we derive the Hv to {'JZ{VWi}^lZ{Vbi}}i=i, where P-operator means take 
derivative with respect to objective function. 

Denote = WiU-\-bi, where u can represent any vector used in neural networks. 

Forward Pass: 

7^{si} = 7^{FFi}x + 7^{6i} ( Since 7^{x} = 0) 

7l{he} = 7^{si} tanh'(si) 

n{^le} = 7^{S2} = 7^{Iy2}he + Iy27^{he} + 7^{52} 

7^{S3} = 7^{Iy3}he + W3n{he} + 7^{53} 

F{cTe} = 7 ^( 83 } exp{s3/2} ( Since (e"^)' = 

7^{z} = 1Z{/j,e} + 7l{ae} O e 

7^{s4} = 7^{W4}z + W47^{z} + 7^{54} 

7l{hd} = 7 ^( 84 } tanh'( 84 ) 

7^{85} = n{W5}hd + W^nihd} + 7^{55} 

"^{y} = 7^{85}sigmoid'(85) 
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Backwards Pass: 


K(Z,y} = K(!^} = 2!||^K(y) 

TZiVs^} = 7l{T>y} © sigmoid'(s5) + Vy 0 sigmoid"(s5) © 7?.{s5} 
n{vw5} = 7e{Ps5}hJ +vs5n{hdf 
TZ{Vh} =TZ{Vs5} 

■R{Dhd} = ■R{W5 }'^Vs 5 + W^TZ{Vsi} 

where the rest can follow the same recursive computation which is similar to gradient derivation. 


10 Variance Analysis (Proof of Theorem 2) 

In this part we analyze the variance of the stochastic estimator. 

Lemma 7. For any convex function f, 

E[0(/(e) - E[/(e)])] < E (|(V/(e), t,))] , (16) 

where e,rj A/’( 0 , Id ^) and e, 77 are independent. 


Proof Using interpolation 'y(uj) — esm(ct;) + ?7COs(cc), then 'y'(u) — ecos(ct;) — ?7sin(ct;), and 7(0) = 
?7,7(7r/2) = e. Furthermore, we have the equation, 


Then 




E,[0(/(e)-E[/(e) 


E,[0(/(e) - E^[/(r,)])] < Ee.„[0(/(e) - fin))] 
E 


< -E 


Jf 4> (|(V/(7(w)),7'(‘^))) dw 


= E [0 (^|(V/(7(a;)),7'(a;)))] do) 

= E[<^(|(V/(e),r,>)]. 


The above two inequalities use the Jensen’s Inequality. The last equation holds because both 7 and 7 ' follow 
A/’( 0 , Id), and £[ 77 '^] = 0 implies they are independent. □ 


Before giving a dimensional free bound, we first let (j){x) = and can obtain a relatively loosen bound 
of variance for our estimators. Assuming / is a L-Lipschitz differentiable function and e ~ A/’(0, Id^), the 
following inequality holds: 

E[(/(e)-E[/(e)])^] < (17) 

To see the reason, we only need to reuse the double sample trick and the expectation of Chi-squared distribu¬ 
tion, we have 


E 


(|(V/(e),»7>)^ 


< 


TT L 


nwvf] = 




Then by Lemma|^ Eq.(18) holds. To get a tighter bound as in Theorem 2, we give the following Lemmaj^and 
Lemma|9]first. 

Lemma 8 ([?]). A random variable X with mean p. = E[X] is sub-Gaussian if there exists a positive number 
a such that for all A G 


then we have 




E [(X < (7^ 
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Proof. By Taylor’s expansion, 


E e 


[gA(X-M)] _ 


= E 




< e 


a^\^l2 


oo 




2H\ ■ 


Thus ^E[(X - nf] < ^ + o(A 2). Let A ^ 0, we have Var(X) < a^. □ 

Lemma 9. If f{x) is a L-lipschitz differentiable function and e G JV(0, Id^ ) then the random variable /(e) — 
E[/(e)] is sub-Gaussian with parameter L, i.e. for all A G 

£ LA(/(e)-E[/(e)])1 ^ 


Proof From Lemma [7] we have 
E I’gAf <v/(6).n) 


=E, 


2^i=i 


/(e)) 


= E, 




= Ee 


\ 2 2 o 

^ I | v /( 6)||2 


< exp 


m- 


□ 


Proof of Theorem 2 Combining Lemmaand Lemmawe complete the proof of Theorem 2. 

In addition, we can also obtain a tail bound, 

Pe~Ar( 0 ,i,j (|/(e) -E[/(6)]| > t) < 2e-^. (18) 

For A > 0, Let ei,..., em be i.i.d random variables with distribution A/^(0, 


M \ f ^ \ 

m=l / \m=l / 


^ p ^gA(E"=i /(e™)-ME[/(e)]) > ^AMt^ 
< E |^e^(E“=l /(e,„)-ME[/(e)])j ^-AMt 


(e[, 


A(/(e^)-E[/(e)]) 


-At V 


According to Lemma R let A = jy, we have P Y.1=i /(^m) - E[/(e)] > t) < e“5^. The other 

side can apply the same trick. Let M = 1 we have Inequality |T^ . Thus Theorem 2 and Inequality |T^ provide 
the theoretical guarantee for stochastic method for Gaussian variables. 


11 More on Conjugate Gradient Descent 

The preconditioned CG is used and theoretically the quantitative relation between the iteration K and relative 
tolerance e is e < exp(—2iG/y^) [?], where c is matrix conditioner. Also the inequality indicates that the 
conditioner c can be nearly as large as 0{K‘^). 


12 Proof of Lemma 4 

Proof Since g{x) = , we have g\x) = g{x)(l - g{x)) < 

l/(e) - /(»7)l = \9{hi{e)) “ 9{hi{vi))\ < \\hi{e) - hi{r))\ < ^||Wi,R|| 2 ||e - »7||2. 

Since tanh(x) = 2g{2x) — 1 and log(l + < 1, the bound is trivial. □ 
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13 Experiments 


All the experiments are conducted on a 3.2GHz CPU computer with X-Intel 32G RAM. For fair comparison, 
the algorithms and datasets we referred to as the baseline remain the same as in the previously cited work and 
software was downloaded from the website of relevant papers. 

Datasets DukeBreast, Leukemia and A 9 a are downloaded from http: / /www. csie . ntu . edu . tw/ 
~c jlin/libsvmtools/datasets/binary . html Datasets Frey Face, Olivetti Face and MNIST are 
downloaded from http : / /www. cs . nyu . edu/ ~roweis/data . html 

13.1 Variational logistic regression 

The optimized lower bound function when the covariance matrix C is diagonal is as following. 



i=l 


where I is the likelihood function. 

The results are shown in Fig. |^and Fig. 


13.2 Variational Auto-encoder 

The results are shown in Fig. |7]and Fig. 


16 



Duke Breast 
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Figure 5: Convergence rate on variational logistic regression: HFSGVI converges within 3 iterations on small 
datasets. 


17 




































Duke Breast regression coefficients 


CD 

CO 

> 



index 


Leukemia regression coefficients 

5|-.-.-.- 



CD 



-10 


- DSVI 

-L-BFGS-SGVI 

- HFSGVI 


-15, 


2000 


4000 

index 


6000 


8000 


A9a regression coefficients 



18 












































MNIST 



(a) Convergenve 


S' s 

t, 

5 

7 

0 4 

d 

7 ? 

7 

8 

7 s 

7 7 

/ 


T 

U U 


S 

s 

5 

5 

5 

3 

3 

3 

3 

3 

3 

3 

3 

2 

3 

3 

/ 

/ 7 


^ 7 

1 

3 7 

7 


7 

3 

S 

9 

3 <7 4 

? 

7 

(» u 


(1 

s 

5 

s 

s 

5 

3 

3 

3 

3 

3 

3 

3 

3 

3 

/ 

/ 


\ 

S' 7 

1 

0 

9 

/ 8 0 

Ic 

V7 

0 4 

4 

1 

U 

U (0 

(0 

6 

G 

6 

s 

s 

5 

5 

3 

3 

3 

3 

3 

3 

3 

/ 

/ 

/ 

7 / 

6 

<=) 9 3 “i 

<£'3 b 

3 

Lp 

/ 


3> 9 P 

7 


(0 (0 

(0 

6 

6 

6 

s 

5 

5 

5 

5 

5 

3 

3 

3 

3 

/ 

/ 

/ 

/ 

53 

/ 


b 

f 0 

1 



1 

3 7 

4 


a u 

(0 

(0 

6 

A 

6 

5 

5 

5 

5 

5 

5 

3 

3 

3 

/ 

/ 

/ 

/ 

9 ^ 

? 

6 / 

.9 

S f 


7- 7 

7 

\ 

q 

(f 

0 Ic 7 

0 


a 6 

u 

(0 

A 

A 

6 

5 

5 

5 

5 

5 

S 

8 

8 

8 

/ 

/ 

/ 

/ 




% 

S 1 

L 

9 


7 ^ 4l[5 7- 7 

a &i[6 

A 

A 

A 


6 

5 

5 

5 

S 

8 

8 

8 

8 

/ 

/ 

/ 

/ 

k / 


V 


S' 

•) 7 


? 

H 

5"-7' 

■ ^7 4 

3l 

a £'i 

A 

A 

A 


6 


5 


8 

8 

8 

8 

8 

/ 

/ 

/ 

/ 


i 

C 1 

■4 7 


g 

y 

V 7>^ 

0 

8 

n 

0 A 

A 

A 

A 

A 




F 

F 8 

8 

8 

8 

3 

t 

/ 

/ 

/ 

0 H 

-7 S 

3 

3 / 

t 

/ 

7 0 

•? 


9 

L 

0 A 

A 

A 

A 

A 



/ y 

8 

8 

9 

9 

9 

t 

/ 

/ 

/ 

g S 

i 5 

7 

0 4 

9 

7 y 

7 

8 

3 

S 

9 7 

/ 

) 

U 

A A 

A 

A 



6 

f 

f g 

f 

8 

r 

9 

9 

9 

1 

/ 

/ 

/ 

/ 7 


^ 7 

\ 

^ 7 

7 


.7 

3 


S 

3 d? 4 

7 

3 

A A 

A 

A 

s 


f 

f 

f 

9 

9 

9 

9 

9 

9 

9 

t 

1 

1 

1 

4 

1 

S' 7 

1 

w 0 

7 


If 

V7 

0 4 

4 

1 

ic 

A A 

A 

A 

f 

f 

f 

f 

9 

9 

9 

9 

9 

9 

9 

9 

1 

1 

t 

1 

7 9 

b 

^ •= 

7 3 3 


■3 

L» 

/ 


^ 9 

P 

7 

5 

\A A S 

A 


r 

f 

f 

9 

9 

9 

9 

9 

9 

9 

9 

9 

1 

1 

1 

5 3 

/ 

f 3 

0 

8 7 

6 

7 0 

1 


V 

t 

3 9 

(j 

4 

C 

\c s s ^ 


y 

9 

y 

y 

9 

9 

9 

9 

9 

9 

9 

9 

1 

1 

1 

9 ^ 

3 

<5 / 

? 

5 7 


7.7 

7 

1 

q 

0 0 lo 7 

0 

n 

c 

\c ^ ^ 


y 

y 

y 

y 

y 

y 

9 

9 

9 

9 

9 

9 

9 

1 

1 

\ 

^ 3 


1 \ 

/ 


V? 

^ 1 

L 

9 

J'7 ^ 4 

s 

4-7 

\<7^. 

y 

V 

y 

y 

y 

y 

y 

9 

9 

9 

9 

9 

9 

9 

7 

7 

1 

7 

/ 




S' 

1 7 . 

r- 

q 

? 

i 

5 --7 

64 

a. 


y 

y 

y 

y 

y 

y 

9 

9 

9 

9 

9 

9 

7 

7 

7 

7 

1 

/ 5 7 5 

f 

C 1 


^ 7l 



/ 


V V 

0 

e 

n 

9 V 

V 

y 

y 

y 

s 

y 

q 

9 

9 

9 

9 

7 

7 

7 

7 

7 

7 

7 

0 ‘1 

7 / 

H 

3 0 g 

3 /■ 

■| 

/ 

7 0 

1 7 

(f 

5 

4 

V 9 


*1 


4 


y 

3 

9 

9 

1 

7 

7 

7 

7 

7 

7 

7 

7 


(b) Reconstruction and Manifold 


Figure 7: (a) Convergence rate in terms of epoch, (b) Manifold Learning of generative model when dz — 2\ 
two coordinates of latent variables z take values that were transformed through the inverse CDF of the Gaussian 
distribution from equal distance grid on the unit square, pe (xlz) is used to generate the images. 
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(a) HFSGVI 


(b) L-BFGS-SGVI 


(c) Ada-SGVI 

Figure 8: Reconstruction Comparison 


20 




